# Vision-language interaction
Qwen2 VL 72B Instruct
Other
Qwen2-VL-72B-Instruct is a multimodal vision-language model that supports interaction between images and text, suitable for complex vision-language tasks.
Image-to-Text
Transformers English

Q
FriendliAI
18
1
Pixtral 12b
Apache-2.0
Pixtral is a multimodal model based on the Mistral architecture, capable of processing both image and text inputs to generate detailed textual descriptions.
Image-to-Text
Transformers

P
mistral-community
31.93k
90
Internlm Xcomposer2 Vl 1 8b
Other
A vision-language large model based on InternLM2 with outstanding image-text understanding and creation capabilities
Text-to-Image
Transformers

I
internlm
169
18
Featured Recommended AI Models